Shrink - Prescribing Resiliency Solutions for Streaming
نویسندگان
چکیده
Streaming query deployments make up a vital part of cloud oriented applications. They vary widely in their data, logic, and statefulness, and are typically executed in multi-tenant distributed environments with varying uptime SLAs. In order to achieve these SLAs, one of a number of proposed resiliency strategies is employed to protect against failure. This paper has introduced the first, comprehensive, cloud friendly comparison between different resiliency techniques for streaming queries. In this paper, we introduce models which capture the costs associated with different resiliency strategies, and through a series of experiments which implement and validate these models, show that (1) there is no single resiliency strategy which efficiently handles most streaming scenarios; (2) the optimization space is too complex for a person to employ a “rules of thumb” approach; and (3) there exists a clear generalization of periodic checkpointing that is worth considering in many cases. Finally, the models presented in this paper can be adapted to fit a wide variety of resiliency strategies, and likely have important consequences for cloud services beyond those that are obviously streaming. INTRODUCTION Streaming query deployments make up a vital part of cloud oriented applications, like online advertising, online analytics, and internet of things scenarios. They vary widely in their data, logic, and statefulness, and are typically executed in multi-tenant distributed environments with varying uptime service level agreements (SLAs), i.e., how often query response time is impacted by failure. Ingress Compute1 Storage1 StorageK Input Input .. Figure 1: Typical Streaming Query Deployment For instance, consider a typical deployment of a streaming query, shown in Figure 1. In this figure, input arrives at or is born at the ingress node. Input is then typically journaled (written) to replicated storage for later analysis, and is therefore sent to multiple storage nodes. The actual streaming computation is performed at the compute node, which may also be running other jobs. Note that compute nodes typically perform stateful computations, like windowed aggregates, which require that various counters and data structures be maintained in memory over time. As these queries are very long running, nodes eventually fail, and one of a number of proposed resiliency strategies [11] is employed to protect against failure. Unfortunately, the choice of resiliency strategy is highly challenging, and scenario dependent. For instance, consider the system described in MillWheel [13]. This system periodically checkpoints the query state, and optionally allows users to implement caching, which is highly useful for scenarios like online advertising. In such scenarios, the event rate is small to moderate (e.g., tens of thousands of events per second), and there are a very large number of states (e.g., one for each browsing session) which are active for a short period of time, then typically expire after a long holding period. Rather than redundantly store states in compute node RAM, states are cached in the streaming nodes for a period, then sent to a key-value store after some time, where they are written in replicated fashion to cheap storage, and typically expire unaccessed. As a result, the RAM needed for streaming nodes is small, and may be checkpointed and recovered cheaply. This design would, however, be untenable for online gaming, where the event rate is high (e.g., millions of events per second), with a large number of active users, and with little locality for a cache to leverage. The tolerance for recovery latency is very low, making it impossible to recover a failed node quickly enough. While many streaming resiliency strategies are discussed in the literature, along with some modeling work, the state of the art does not quantify the performance and resource cost tradeoffs across even basic strategies in a way which is actionable in today's cloud environments. For instance, prior efforts (e.g., [11]) do not consider uptime SLAs and resource reservation costs, leading to analyses useful for establishing some intuition for the differences between approaches, but not for selecting strategies in today’s datacenter oriented applications. Lacking tools or frameworks sufficient to prescribe resiliency approaches, practitioners typically choose the technique which is easiest to implement, or in cases like MillWheel, build systems tailored to solve particular classes of problems, hoping that these systems will have high general applicability. This paper presents an analytical framework based on uptime SLAs and resource reservation, as well as detailed analyses of a number of resiliency designs for streaming systems. We show: One size doesn’t fit all: There is no resiliency strategy which efficiently covers most of the streaming query space. Specific strategies can be vastly better compared to others (by orders This work is licensed under the Creative Commons AttributionNonCommercial-NoDerivatives 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For any use beyond those covered by this license, obtain permission by emailing [email protected]. Proceedings of the VLDB Endowment, Vol. 10, No. 5 Copyright 2017 VLDB Endowment 2150-8097/17/01.
منابع مشابه
A Method to Reduce Effects of Packet Loss in Video Streaming Using Multiple Description Coding
Multiple description (MD) coding has evolved as a promising technique for promoting error resiliency of multimedia system in real-time application programs over error-prone communicational channels. Although multiple description lattice vector quantization (MDCLVQ) is an efficient method for transmitting reliable data in the context of potential error channels, this method doesn’t consider disc...
متن کاملHybrid algorithms for Job shop Scheduling Problem with Lot streaming and A Parallel Assembly Stage
In this paper, a Job shop scheduling problem with a parallel assembly stage and Lot Streaming (LS) is considered for the first time in both machining and assembly stages. Lot Streaming technique is a process of splitting jobs into smaller sub-jobs such that successive operations can be overlapped. Hence, to solve job shop scheduling problem with a parallel assembly stage and lot streaming, deci...
متن کاملCritical Path Method for Flexible Job Shop Scheduling Problem with Preemption
This paper addressed a Flexible Job shop Scheduling Problem (FJSP) with the objective of minimization of maximum completion time (Cmax) which job splitting or lot streaming is allowed. Lot streaming is an important technique that has been used widely to reduce completion time of a production system. Due to the complexity of the problem; exact optimization techniques such as branch and bound alg...
متن کاملPeer-to-peer media streaming based on network coding over random multicast trees
Network coding is known to provide increased throughput and reduced delay for networked communications. In this paper we propose a peer-to-peer media streaming system that exploits network coding in order to achieve low start-up delay, high streaming rate, and high resiliency to peers’ and network dynamics, such as ungraceful peer departures, and delays or packet losses. To achieve this objecti...
متن کاملError resiliency schemes in H.264/AVC standard
Real-time transmission of video data in network environments, such as wireless and Internet, is a challenging task, as it requires high compression efficiency and network friendly design. H.264/AVC is the newest international video coding standard, jointly developed by groups from ISO/IEC and ITU-T, which aims at achieving improved compression performance and a network-friendly video representa...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- PVLDB
دوره 10 شماره
صفحات -
تاریخ انتشار 2017